-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ci: refine talos terraform script #2172
Conversation
WalkthroughThe pull request updates the Terraform configuration for AWS and Talos resources, focusing on simplifying machine configurations and resource declarations. Key changes include the removal of Changes
Assessment against linked issues
Possibly related PRs
Suggested reviewers
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (1)
test_framework/terraform/aws/talos/main.tf (1)
272-283
: Excellent addition of cluster health monitoringThe new
talos_cluster_health
data source is a valuable addition that:
- Ensures proper cluster health verification
- Has correct dependency chain with configuration and kubeconfig
- Monitors both control plane and worker nodes
This aligns well with the objective of verifying Longhorn v1.8.0 on Talos v1.8.2.
Consider adding outputs to expose the health status for integration with CI/CD pipelines or monitoring systems.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
test_framework/terraform/aws/talos/main.tf
(3 hunks)
🔇 Additional comments (3)
test_framework/terraform/aws/talos/main.tf (3)
226-226
: Consider impact of standardized configurations
The changes now apply identical configurations to all nodes of the same type. While this simplifies the setup, please verify that:
- This standardization meets Longhorn's requirements for v1.8.0 testing
- There's no need for node-specific configurations in your test scenarios
Also applies to: 235-235
259-269
: LGTM: Improved kubeconfig management
The change from data
to resource
for kubeconfig management is a good improvement as it:
- Provides better lifecycle management of the kubeconfig
- Ensures the kubeconfig is actively managed rather than just read
- Maintains proper dependency chain with
talos_machine_bootstrap
218-218
: Verify the existence and content of talos-patch-worker.yaml
The configuration now references a new worker-specific patch file. Please ensure that:
- The file
talos-patch-worker.yaml
exists in the module directory - The worker-specific configurations are properly defined
✅ Verification successful
Worker patch file exists and is properly configured
The separation of worker and control plane configurations is correctly implemented:
- The worker patch file exists and contains the necessary machine configurations for Longhorn (extra mounts, sysctls, and kernel modules)
- The main difference is that the worker patch excludes control plane specific settings (
machine.type: init
andcluster.allowSchedulingOnControlPlanes
), which is appropriate for worker nodes - All required configurations for Longhorn functionality are preserved in the worker patch
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check if the worker patch file exists and compare with the original patch
fd -t f "talos-patch-worker.yaml"
if [ $? -eq 0 ]; then
echo "Content of worker patch file:"
cat talos-patch-worker.yaml
echo -e "\nContent of original patch file:"
cat talos-patch.yaml
fi
Length of output: 416
Script:
#!/bin/bash
# Let's check the actual path found by fd and try to read the file from there
fd -t f "talos-patch-worker.yaml"
if [ $? -eq 0 ]; then
echo "Content of worker patch file:"
cat test_framework/terraform/aws/talos/talos-patch-worker.yaml
fi
# Also check for the original patch file
echo -e "\nChecking for original patch file:"
fd -t f "talos-patch.yaml"
if [ $? -eq 0 ]; then
echo "Content of original patch file:"
cat test_framework/terraform/aws/talos/talos-patch.yaml
fi
Length of output: 1289
6f3a630
to
ecc4cb9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (6)
test_framework/scripts/longhorn-setup.sh (2)
Line range hint
1-502
: Consider structural improvements for maintainability and reliability.The script could benefit from several improvements:
- Move hardcoded URLs to variables at the top of the script
- Add input validation for critical environment variables
- Enhance error handling in installation functions
- Add more detailed logging for troubleshooting
Example improvements:
+ # Configuration + readonly MINIO_BACKUPSTORE_URL="https://raw.githubusercontent.com/longhorn/longhorn/master/deploy/backupstores/minio-backupstore.yaml" + readonly NFS_BACKUPSTORE_URL="https://raw.githubusercontent.com/longhorn/longhorn/master/deploy/backupstores/nfs-backupstore.yaml" + readonly LONGHORNCTL_VERSION="v1.7.2" + # Validation + validate_environment() { + local required_vars=( + "LONGHORN_NAMESPACE" + "TF_VAR_tf_workspace" + "DISTRO" + ) + + for var in "${required_vars[@]}"; do + if [[ -z "${!var}" ]]; then + echo "Error: Required environment variable $var is not set" + exit 1 + fi + done + } main() { + validate_environment set_kubeconfig ... }
Line range hint
32-38
: Add retry mechanism for external resource downloads.The script downloads resources from external URLs without retry mechanisms. Network issues could cause intermittent failures.
Example improvement:
+ download_with_retry() { + local url="$1" + local output="$2" + local max_attempts=3 + local attempt=1 + + while [[ $attempt -le $max_attempts ]]; do + if curl -L --fail "$url" -o "$output"; then + return 0 + fi + echo "Download attempt $attempt failed. Retrying in 5 seconds..." + sleep 5 + attempt=$((attempt + 1)) + done + echo "Failed to download $url after $max_attempts attempts" + return 1 + } install_backupstores(){ - MINIO_BACKUPSTORE_URL="https://raw.githubusercontent.com/longhorn/longhorn/master/deploy/backupstores/minio-backupstore.yaml" - NFS_BACKUPSTORE_URL="https://raw.githubusercontent.com/longhorn/longhorn/master/deploy/backupstores/nfs-backupstore.yaml" - kubectl create -f ${MINIO_BACKUPSTORE_URL} \ - -f ${NFS_BACKUPSTORE_URL} + local tmp_minio="/tmp/minio-backupstore.yaml" + local tmp_nfs="/tmp/nfs-backupstore.yaml" + + if download_with_retry "$MINIO_BACKUPSTORE_URL" "$tmp_minio" && \ + download_with_retry "$NFS_BACKUPSTORE_URL" "$tmp_nfs"; then + kubectl create -f "$tmp_minio" -f "$tmp_nfs" + else + echo "Failed to download backupstore manifests" + exit 1 + fi }Also applies to: 168-171
test_framework/terraform/aws/talos/main.tf (4)
226-226
: LGTM: Simplified machine configuration applicationThe removal of
count.index
frommachine_configuration_input
ensures consistent configuration across all nodes of the same type. This is a good practice that reduces the chance of configuration drift.Note: This change means all nodes of the same type (control plane or worker) will have identical configurations. If you need node-specific configurations in the future, you'll need to modify this approach.
Also applies to: 235-235
259-265
: Consider standardizing array indexing syntaxThe change from
data
toresource
fortalos_cluster_kubeconfig
is good as it indicates active management. However, the array indexing syntax.0
differs from the array syntax[0]
used elsewhere in the file.Consider using consistent array indexing throughout:
- endpoint = aws_instance.lh_aws_instance_controlplane.0.public_ip - node = aws_instance.lh_aws_instance_controlplane.0.private_ip + endpoint = aws_instance.lh_aws_instance_controlplane[0].public_ip + node = aws_instance.lh_aws_instance_controlplane[0].private_ip
272-283
: LGTM: Added cluster health monitoringThe addition of
talos_cluster_health
is excellent for verifying cluster readiness. The dependencies ensure proper sequencing of operations.Consider adding a local variable to capture the health check result and use it in subsequent resources or outputs:
output "cluster_health_status" { description = "Status of the Talos cluster health check" value = data.talos_cluster_health.this.healthy }
Line range hint
1-283
: Architecture Review: Good improvements for cluster managementThe changes enhance the Terraform configuration in several ways:
- Better separation of worker node configurations
- Improved cluster health monitoring
- More robust kubeconfig management
These improvements align well with the goal of verifying Longhorn v1.8.0 on Talos v1.8.2, as they provide better control and visibility into the cluster's state.
Consider adding:
- Terraform outputs for key cluster information
- Timeout settings for the health checks
- Documentation comments explaining the worker-specific configurations needed for Longhorn
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (2)
test_framework/scripts/longhorn-setup.sh
(1 hunks)test_framework/terraform/aws/talos/main.tf
(3 hunks)
🔇 Additional comments (1)
test_framework/terraform/aws/talos/main.tf (1)
218-218
: Verify the new worker patch file content
The configuration now references a new file talos-patch-worker.yaml
instead of talos-patch.yaml
. Please ensure that:
- The new file exists in the module directory
- The worker-specific configurations are compatible with Longhorn v1.8.0 requirements
✅ Verification successful
Worker patch file changes are appropriate and maintain Longhorn compatibility
The separation of worker and control plane patches is correct. The new talos-patch-worker.yaml
maintains all the essential configurations for Longhorn compatibility:
- Required mount points for
/var/lib/longhorn
with correct bind options - Necessary kernel modules (nvme_tcp, vfio_pci, uio_pci_generic)
- Required sysctl settings (vm.nr_hugepages)
The only differences are the removal of control-plane specific settings:
- Removed
type: init
(control plane specific) - Removed
allowSchedulingOnControlPlanes: false
(cluster-wide setting)
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check if the new patch file exists and compare with the old one
if [ -f "test_framework/terraform/aws/talos/talos-patch-worker.yaml" ]; then
echo "New worker patch file exists"
if [ -f "test_framework/terraform/aws/talos/talos-patch.yaml" ]; then
diff test_framework/terraform/aws/talos/talos-patch.yaml test_framework/terraform/aws/talos/talos-patch-worker.yaml
fi
else
echo "ERROR: talos-patch-worker.yaml not found"
exit 1
fi
Length of output: 402
Script:
#!/bin/bash
# Let's examine the content of both files to better understand the changes
echo "=== Content of talos-patch-worker.yaml ==="
cat test_framework/terraform/aws/talos/talos-patch-worker.yaml
echo -e "\n=== Content of talos-patch.yaml ==="
cat test_framework/terraform/aws/talos/talos-patch.yaml
Length of output: 1038
sleep 86400 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove or justify the 24-hour sleep command.
The addition of a 24-hour sleep command (sleep 86400
) at the beginning of the main function will cause significant delays in CI/testing pipelines. This appears to be a debugging artifact that should be removed.
Apply this diff to remove the sleep command:
- sleep 86400
be8f920
to
594455e
Compare
Signed-off-by: Yang Chiu <[email protected]>
594455e
to
0e8c957
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (3)
test_framework/terraform/aws/talos/variables.tf (1)
34-34
: Update documentation to reflect version changeAs mentioned in the PR objectives, please ensure the Longhorn documentation, particularly the best practices section related to the operating system, is updated to reflect this new Talos version requirement.
#!/bin/bash # Description: Check for documentation files that might need updates # Search for documentation files mentioning Talos versions echo "Searching for documentation files mentioning Talos versions..." rg --type md "talos.*1\.[0-9]"test_framework/terraform/aws/talos/main.tf (2)
255-261
: Good architectural change: Enhanced kubeconfig managementConverting
talos_cluster_kubeconfig
from a data source to a resource type improves the infrastructure management by:
- Enabling active management of the kubeconfig
- Making the resource lifecycle more explicit
- Better integrating with Terraform's state management
The syntax updates for endpoint reference and content handling are appropriate for this change.
Also applies to: 264-265
Line range hint
214-279
: Overall: Well-structured infrastructure improvementsThe changes in this PR demonstrate thoughtful improvements to the Talos infrastructure:
- Separated worker configurations for better maintainability
- Simplified resource references
- Enhanced kubeconfig management
- Added health monitoring
These changes provide a more robust foundation for verifying Longhorn v1.8.0 on Talos v1.8.2.
However, please ensure that:
- The new
talos-patch-worker.yaml
file is included in the PR- The Talos provider version supports all the new features
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (2)
test_framework/terraform/aws/talos/main.tf
(3 hunks)test_framework/terraform/aws/talos/variables.tf
(1 hunks)
🔇 Additional comments (5)
test_framework/terraform/aws/talos/variables.tf (2)
34-34
: Verify compatibility with significant version jump
The change from Talos 1.7.6 to 1.8.3 represents a significant version jump. Please ensure:
- All Longhorn features are compatible with this version
- No breaking changes affect the test infrastructure
- Required Talos configurations are updated accordingly
Would you like me to help create a GitHub issue to track the compatibility verification tasks?
34-34
:
Version mismatch with PR objectives
The PR objectives mention testing Longhorn v1.8.0 on Talos v1.8.2, but this change sets the version to 1.8.3. Please verify if this version bump is intentional and aligns with the testing requirements.
✅ Verification successful
Version update is acceptable for testing purposes
The version change from 1.7.6 to 1.8.3 in the test framework is acceptable because:
- The test framework is designed to verify Longhorn compatibility with different Talos versions
- Testing with the latest minor version (1.8.3) provides better coverage than testing with 1.8.2
- No version-specific dependencies or configurations were found in the codebase that would be affected by this change
- The Talos configurations in aws.yaml and patch files are version-agnostic
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for any version-specific configurations or dependencies in the codebase
# Search for any hardcoded Talos version references
echo "Searching for Talos version references..."
rg -g '!variables.tf' "1\.8\.[0-9]"
# Check if there are any version-specific configuration files
echo "Checking for version-specific configurations..."
fd -g "*1.8*.yaml" -g "*1.8*.tf"
Length of output: 366
Script:
#!/bin/bash
# Let's try a different approach to find version references and related files
# Search for Talos version mentions in markdown files and configs
echo "Searching for Talos version mentions in documentation..."
rg -g "*.md" -g "*.yaml" -g "*.tf" "talos.*1\.8\.[0-9]" -i
# Look for any version-related variables or configurations
echo "Searching for version-related configurations..."
rg "version.*=.*1\.8\.[0-9]"
# Check the PR description or related docs for version requirements
echo "Checking documentation files in test framework..."
fd -t f -e md -e yaml -e tf . test_framework/
Length of output: 4380
Script:
#!/bin/bash
# Let's check the specific Talos configuration files and PR description
# Check the Talos configuration files for version-specific settings
echo "Checking Talos configuration files..."
cat test_framework/terraform/aws/talos/aws.yaml test_framework/terraform/aws/talos/talos-patch.yaml test_framework/terraform/aws/talos/talos-patch-worker.yaml
# Check main.tf for any version dependencies
echo "Checking main.tf for version dependencies..."
cat test_framework/terraform/aws/talos/main.tf
# Check if there are any version constraints in the variables file
echo "Checking complete variables file..."
cat test_framework/terraform/aws/talos/variables.tf
Length of output: 11938
test_framework/terraform/aws/talos/main.tf (3)
222-222
: LGTM: Simplified machine configuration references
Good refactoring to use a single shared configuration for all nodes of the same type. This change improves code maintainability while preserving the intended functionality.
Also applies to: 231-231
214-214
: Verify the existence and content of talos-patch-worker.yaml
The worker configuration now references a new patch file talos-patch-worker.yaml
. Please ensure this file exists and contains the appropriate worker-specific configurations.
✅ Verification successful
Worker patch file exists but has critical differences from control plane patch
The verification shows that talos-patch-worker.yaml
exists but has important differences from talos-patch.yaml
:
- Removed
type: init
configuration - Removed control plane scheduling restrictions (
allowSchedulingOnControlPlanes: false
)
These changes appear intentional and align with worker node requirements, as:
- Worker nodes don't need initialization type (they join rather than initialize)
- Worker nodes don't need control plane scheduling configurations
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check if the worker patch file exists and compare with the original patch file
# Check if the new worker patch file exists
if [ -f "test_framework/terraform/aws/talos/talos-patch-worker.yaml" ]; then
echo "Worker patch file exists"
# Compare with original patch file if it exists
if [ -f "test_framework/terraform/aws/talos/talos-patch.yaml" ]; then
echo "Differences between patch files:"
diff test_framework/terraform/aws/talos/talos-patch.yaml test_framework/terraform/aws/talos/talos-patch-worker.yaml
fi
else
echo "ERROR: Worker patch file not found"
exit 1
fi
Length of output: 469
268-279
: Excellent addition: Cluster health monitoring
The new talos_cluster_health
data source is a valuable addition that:
- Ensures cluster readiness before proceeding
- Monitors both control plane and worker nodes
- Properly depends on all required resources
This aligns well with the objective of verifying Longhorn v1.8.0 on Talos v1.8.2.
✅ Verification successful
Cluster health monitoring implementation verified
The implementation is correctly configured with:
- Talos provider version >= 0.4.0 which supports the cluster health data source
- Proper dependency chain ensuring health checks run after cluster configuration
- Comprehensive node monitoring covering both control plane and worker nodes
- No downstream dependencies on the health check, which is appropriate as it's used as a verification step
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify Talos provider version supports cluster health monitoring
# Check Talos provider version in lock file
if [ -f ".terraform.lock.hcl" ]; then
echo "Checking Talos provider version..."
grep -A 5 "provider \"registry.terraform.io/siderolabs/talos\"" .terraform.lock.hcl
fi
# Check if there are any existing health check implementations
echo "Checking for existing health checks..."
rg -l "talos_cluster_health" --type tf
Length of output: 201
Script:
#!/bin/bash
# Let's check the provider version requirements and any existing health check configurations
# Check provider version requirements
echo "Checking provider version requirements..."
rg "required_providers" -A 5 --type tf test_framework/terraform/aws/talos/
# Check the context around health check implementation
echo -e "\nChecking health check implementation context..."
rg "talos_cluster_health" -B 5 -A 10 --type tf
Length of output: 2029
Script:
#!/bin/bash
# Let's check the Talos provider version and any outputs that might depend on the health check
# Check Talos provider version specification
echo "Checking Talos provider version..."
rg "talos = {" -A 3 --type tf test_framework/terraform/aws/talos/
# Check if health check is used in any outputs or dependencies
echo -e "\nChecking health check usage in outputs..."
rg "depends_on.*talos_cluster_health" --type tf
rg "output.*talos_cluster_health" -A 5 --type tf
Length of output: 428
Script:
#!/bin/bash
# Let's try a different approach to check the Talos provider version and health check usage
# Check Talos provider version with proper escaping
echo "Checking Talos provider version..."
rg "source.*siderolabs/talos" -A 2 --type tf
# Check for any resources or outputs that reference the health check
echo -e "\nChecking health check references..."
rg "\"talos_cluster_health\"" -C 5 --type tf
# Check if there are any other resources depending on the health check
echo -e "\nChecking dependencies on health check..."
rg "depends_on.*cluster_health" --type tf
Length of output: 1395
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Which issue(s) this PR fixes:
Issue longhorn/longhorn#9764
What this PR does / why we need it:
refine talos terraform script
Special notes for your reviewer:
Additional documentation or context
Summary by CodeRabbit
New Features
Improvements
Bug Fixes